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Abstract 

Background: Periodic proteins, characterized by the presence of multiple repeats of short motifs, form an interesting 
and seldom-studied group. Due to often extreme divergence in sequence, detection and analysis of such motifs is 
performed more reliably on the structural level. Yet, few algorithms have been developed for the detection and analysis 
of structures of periodic proteins. 

Results: ConSole recognizes modularity in protein contact maps, allowing for precise identification of repeats in solenoid 
protein structures, an important subgroup of periodic proteins. Tests on benchmarks show that ConSole has higher 
recognition accuracy as compared to Raphael, the only other publicly available solenoid structure detection tool. As 
a next step of ConSole analysis, we show how detection of solenoid repeats in structures can be used to improve 
sequence recognition of these motifs and to detect subtle irregularities of repeat lengths in three solenoid protein families. 

Conclusions: The ConSole algorithm provides a fast and accurate tool to recognize solenoid protein structures as a whole 
and to identify individual solenoid repeat units from a structure. ConSole is available as a web-based, interactive server and is 
available for download at http//console.sanfordbumham.org. 
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Background 

Current estimates suggest that approximately 30% of hu- 
man proteins contain multiple repeats of short motifs and 
could be classified as "periodic proteins" [1]. In many cases, 
proteins with such motifs fold into three-dimensional struc- 
tures resembling solenoids (Greek solen (pipe) eidos (form)) 
and thus are called solenoid or solenoid-like proteins. A 
well-known example of solenoid proteins are Leucine Rich 
Repeats (LRRs) present in the innate immunity or receptors 
(NLR or TLR, respectively) and in thousands of other 
proteins with various other functions and extremely vari- 
able consensus sequences [2]. Other examples include 
Ankyrin repeats involved in various protein-protein in- 
teractions and Armadillo repeats that, together with 
other homologous classes, such as HEAT repeats, form 
helical solenoids and are found in proteins involved in 
cell adhesion [3,4]. 

Solenoid proteins evolved by a series of duplications of 
an ancestral motif, but the precise order of duplications is 
often unknown and may differ between and sometimes 
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even within families. Accumulated mutations, deletions, 
and insertions lead to increasing divergence between indi- 
vidual repeats. For many proteins, this divergence can be 
quite extreme with almost no sequence similarity between 
individual copies of the ancestral motif [5]. As a result, so- 
lenoid repeats are often difficult to recognize in sequence, 
for instance Pfam Hidden Markov Models recognize less 
than half of the repeats in NLR and TLR proteins. Hence, 
automated detection of subtle motif variations from se- 
quence is often impossible. 

Because protein structures tend to be more conserved 
than sequences, similarity is retained on the structural level 
and recognition of the repeats is thus easier [6]. Still, re- 
peats have significant variations of length and shape, mak- 
ing the precise recognition of individual solenoid units 
highly nontrivial. For instance, in LRR proteins the length 
of the individual repeats varies between 18 and 34, and not 
a single position, including the leucines forming the telltale 
pattern, is universally conserved in all repeats. The local di- 
vergence of the motifs has consequences on the global- 
structure level. In LRR proteins, the curvature of the entire 
domain varies from an ideal curvature in Ribonuclease In- 
hibitors (RIs) or NLRs [7] to an irregular curvature of TLRs 
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[8], with consequences for the binding properties within 
the inner cavity of the protein. 

Detection of repeats in proteins, both on the sequence 
and structure level, has gained importance as structures of 
more proteins with solenoid repeats have become known. 
Almost simultaneously, the sensitivity of sequence-based 
recognition has improved. Both these trends resulted in 
better appreciation of the relative number of proteins with 
repeats and the importance of the detection problem. 
Various detection algorithms of repeated motifs in protein 
sequences have been developed, with Gibbs sampling [9] 
and RADAR [10] as some of the first, and many others 
have followed [11,12]. Some of them are focused specific- 
ally on solenoid repeats in which Fourier-based analysis 
seems to produce the best results [13,14]. 

To the best of our knowledge, only four detectors of re- 
petitive units in protein structures have been described in 
literature: (i) DAVROS [15] was probably the first method 
for this task, with detection based on a self-alignment 
matrix, (ii) ProStrip [16] performs repeat detection based on 
the C a backbone angles, {Hi) Raphael [6] is specifically de- 
voted to the detection of solenoid repeats and is based on 
repeated Fourier analysis of C a coordinates with appended 
machine-learning classification, and (iv) a hierarchical ap- 
proach based on successive bisection of the structure into 
tiles for self-alignment [17]. Raphael is the best solenoid 
classifier to date as it significantly exceeds solenoid detection 
performance of sequence-based methods, while the hier- 
archical structural analysis from [17] is the most versatile ap- 
proach to detect all possible types of structural repeats. 

Here we present ConSole, a new method to determine the 
presence and specific positions of individual solenoid repeat 
units within protein structures. Template matching, a popu- 
lar image-processing procedure, applied to contact maps de- 
termines whether individual residues are part of a solenoid 
domain or part of a non-solenoid segment. This approach is 
further generalized to classify whether a whole protein struc- 
ture under scrutiny is solenoid or non-solenoid. ConSole is 
assessed on a benchmark dataset and directly compared to 
Raphael, the only publicly available solenoid detection algo- 
rithm. We furthermore demonstrate how accurate detection 
and subsequent structural alignment of solenoid units can 
be used to automatically retrieve the solenoid sequence 
motif from structure. Finally, as an example of a large-scale 
analysis enabled by the development of the ConSole algo- 
rithm, we analyze the length distribution of solenoid units in 
a large number of solenoid-like protein structures to auto- 
matically detect subtle variations of solenoid units in three 
solenoid protein families. 

Methods 

Pattern of solenoid units in contact maps 

Contact maps (CMs) provide a simple but powerful 
means for protein structure comparison and alignment 



[18,19], prediction [20,21] and visualization of protein 
structural features [22]. Here we show how CMs can be 
used to identify solenoid proteins and to calculate lengths 
of individual units, even for very divergent repeats. 

Below, we briefly specify the contact map definition 
used here and explain how solenoid unit lengths are esti- 
mated. Specifically, we define sidechains of residue i and 
/ of all N residues to be in contact if the distance of any 
pair of their heavy atoms a b is below a specified 
threshold: 

c (ij):\ai-aj\ < t (1) 

We use the distance threshold of t = 4.5 A, following 
our earlier applications of contact maps [23], and assign 
a value of 1 (True) or 0 (False) to each of the NxN posi- 
tions on the map. As expected, structural repeats in pro- 
tein structures correspond to repeating patterns in the 
CM. The most striking feature of CMs for solenoid pro- 
tein structures is an almost continuous line d 2 of con- 
tacts running parallel to the main diagonal d 2 . The 
presence of a point on d 2 indicates that a residue from 
one solenoid unit is in contact with a residue in the 
neighboring units (Figure 1). We also tested other con- 
tact definitions (C a , Cp), but they did not reveal alternative 
significant features in the maps for solenoid detection 
other than d 2 . 

We define A as the average repeat length in a solenoid 
protein structure. In contact maps, the distance between 
dx and d 2 is related to the repeat length by a formula 
A = | dx - d 2 | . Because contacts are aligned along the 
main diagonal in the CM, we have to iterate along dx in 
order to determine the most frequent contact length: 

A = argmaxl° =6 Y^=o c 1 + n ^ ^ 

Argmax returns the argument with the maximum 
value of a function. Sampling the complete CM to obtain A 
is not required since A is expected to be in the interval 
[6; 60] of potentially contacting residues. These boundaries 
are based on the fact that contacts shorter than 6 residues 
are within the a-helical contact range and that solenoid 
unit lengths beyond 60 residues are virtually nonexistent 
[1]. Lengths of solenoid repeats are typically in the [12; 45] 
interval. Repeat lengths X t of individual solenoids unit span- 
ning over a short segment [i;i + AJ can also be calculated 
when the detection in equation 2 is confined to [i; i + A]. 

Rule-based classification of solenoids vs. non-solenoids 

In many solenoid proteins, regular repeats are inter- 
rupted by insertions that are not part of the solenoid. 
We developed a rule-based classifier analyzing only con- 
tact information to detect whether a residue is or is not 
a part of a solenoid unit. It mimics a human approach 
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Figure 1 Contact maps of solenoid protein structures. Contact 
maps of two solenoid protein structures. A line d 2l parallel to the 
main diagonal, is clearly visible in both CMs. d 2 is almost fully continuous 
for the highly regular Ribonuclease Inhibitor (1DFJ — Chain I) and has 
some gaps in the Clathrin structure (1B89 — Chain A), reflecting more 
variable interaction patterns between helices. 



on detecting solenoid units in contact maps. This classi- 
fier is based on sampling each line in the contact map 
for each residue i, starting at the main diagonal [i,i] for 
each sample. The first step toward determining whether 
residue i is or is not part of the solenoid unit was to find 
gaps within d 2 . Gaps in d 2 indicate insertions where the 
given segment does not interact with the next turn of 
the solenoid. We defined significant contact gaps to be 



at least 5 residues long, without any upper margin. A 
gap starting at residue i is defined as: 

gap{i) = if Y/e [/; i + 5], V/ce [/ + 6; / + 60] c (/, k) 

(3) 

In order to identify solenoid units with variable 
lengths, we iterate over the CM and analyze each residue 
i and the following residues within a window 1 = [i;i + X], 
We assign a solenoid unit starting at residue i by: 



solenoid (i) = if V/e/:-, gap (J) A 0.5 -A < A,- < 2 • X 



(4) 



Then, we reassess each solenoid unit by determining 
the individual unit length X t with equation 2. We anno- 
tate a solenoid unit starting at i and ending at i + X t if 
solenoid(i) is true. The algorithm then continues at 
i = i + \ t + 1. If solenoid(i) is false, however, we continue 
either sti = i + l or at the end of a gap. 

Template matching and SVM-based classification of solenoids 
vs. non-solenoids 

The core algorithm implemented in ConSole is based on 
image-processing methods to detect solenoids and non- 
solenoid regions in protein structures. For this, we apply 
a template-matching algorithm to the contact map and 
classify the resulting scores with a trained Support Vec- 
tor Machine (SVM). 

Template matching in a contact map 

Template matching (TM) is a popular image-processing 
technique allowing one to find specific patterns (P) in 
images (I) or other multidimensional data. A standard 
approach in TM is to use normalized cross correlation 
(NCC) in order to find potential areas resembling the 
searched pattern P. I and P must not necessarily match 
in size. Moreover, it is rather common that P is signifi- 
cantly smaller than / with accurate normalization ac- 
counting for the size difference. The NCC is typically 
defined as: 



NCC (/, S) x 



x + i, y + 



(5) 



where S is the mean and o s the standard deviation of «S, 
I Xi y is the mean and 07 the standard deviation of a 
region around x, y in / with the same size as S. NCC 
returns a matrix containing correlation coefficients in 
the range of [-1; 1]. A result of 1 indicates a perfect 
match, 0 indicates no similarity, and -1 indicates inverse 
similarity [24]. 

Images and CMs are usually represented as a matrix of 
NxM pixels, and, hence, NCC can be used to localize 
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any kind of pattern within a CM. In order to determine 
whether a residue is part of a solenoid or not, two patterns 
are correlated with the structures CM: (i) one pattern 
representing the typical solenoid contact pattern with two 
parallel diagonals, resulting in a correlation matrix Mj and 
(it) one pattern with only one main diagonal for non- 
solenoids, resulting in a correlation matrix M 2 (Figure 2). 

Both patterns are generated dynamically at runtime. 
The pattern size in the x and y dimensions are set to 2\ 



in order to accommodate d 2 fully in the solenoid pat- 
tern. This way, both patterns used in the analysis are 
adapted to the specific solenoid length of the given 
structure. 

Support vector classification of correlation features 

The Support Vector Machine is a machine-learning 
method used for supervised classification in many com- 
putational disciplines [25]. It is especially renowned for 




0 100 200 300 400 




-0.1 I 1 -0.4 I 

0 10 20 30 40 0 10 20 30 40 



Figure 2 Solenoid contact patterns, (a) The correlation matrix M, determined for template matching the contact map of PDB 1 DFJ-Chain I 
determined with the solenoid pattern (b). Bright regions indicate high correlation values [0;1], while dark regions indicate low [-1; 0] correlation 
values, (b) The solenoid pattern to generate M 1 and (c) the non-solenoid pattern to generate M 2 used for template matching, (d) Plot of the 
correlation coefficients determined for residue 75 in chain I of 1DFJ. Twenty correlation coefficients were collected around the main diagonal 
[(75,55); (75,95)] from M 1 and merged with 20 coefficients from M 2 from the same positions, (e) Coefficients determined for residue 81 in chain B 
of the globular structure from 1QGC. Correlation features determined for solenoid residues have smoother profiles compared to rather noisy features 
of globular proteins. Peak correlation values also differ for the solenoid/non-solenoid combinations, respectively. 

V J 
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being able to classify multidimensional data while main- 
taining a low error rate based on its maximum margin 
hyper-plane determined during training. 

In ConSole, we make use of the SVM to assign residues 
to solenoid or non-solenoid classes according to their pre- 
viously determined correlation coefficients. We therefore 
collect correlation coefficients around the main diagonal 
from the NCC results as shown in Figure 2. Feature vec- 
tors are generated by concatenating 20 correlation coeffi- 
cients from Mj with 20 correlation coefficients from M 2 
(Figure 2). All coefficients were extracted from their re- 
spective matrices at the positions [(i,i - 10);(i,i + 10)]. Vis- 
ual inspection of the feature vectors clearly indicated 
significant differences between both correlation features 
for the same CM regions. We observed smooth correlation 
peaks for solenoid segments while correlation features of 
globular proteins had rather noisy shapes. Conclusively, 
the shape of the correlation peaks provides a characteristic 
feature for automated classification. Class labels were avail- 
able from the corresponding benchmark annotation and 
assigned to each feature vector prior to SVM training. 

Final classification of structures 

In order to compare classification results to the results in 
literature, we extend classification of individual residues 
and repeats to the level of complete structures using a 
simple threshold measure. If the ratio of the total number 
of residues classified as being in solenoid units to the 
number of all residues exceeds the threshold value, the 
whole structure is classified to be a solenoid (Equation 6): 



We extract the common core determined by POSA to 
build a sequence alignment from the respective common 
core overlaps [28]. Finally, we use Weblogo to visualize 
the consensus motif [29] for the repeat. 

Solenoid benchmark data 

We used a previously published test dataset to assess the 
accuracy of ConSole. This dataset was originally estab- 
lished for testing sequence repeat detectors [13] and has 
since been used as a benchmark for both sequence [14] 
and structure based repeat detectors [6]. 

The benchmark comprises 105 solenoid structures for 
which A, solenoid and non-solenoid residues, have been 
manually annotated. A total of 247 non-solenoid struc- 
tures were also included in this dataset to provide a large 
variety of non-solenoid samples. The dataset contains 
80,347 residues in total, out of which 19,197 were anno- 
tated as being part of solenoid repeats. 

Implementation 

All the algorithms described here were implemented in 
Python, utilizing additional packages such as Biopython 
[30] for accessing PDB files, PyTom [31] for correlation 
functions and parallel processing on multiple CPUs, and 
Scikit [32] to interface with the machine-learning algo- 
rithms. The algorithm used on the server is also avail- 
able for download from the server page http: //console. 
sanfordburnham.org. Residue classification results are 
available in XML format containing solenoid unit boun- 
daries for further analysis. 



solenoid {structure) 



# solenoids 

# residues 



> t 



(6) 



Setting the t value to 0.5 provided the best agreement 
with benchmark results. A detailed assessment of diffe- 
rent t values is presented in the Additional file 1. 

Detection of solenoid sequence-motifs by solenoid unit 
alignment 

We extended the solenoid detection algorithm with an 
automated feature to recognize individual solenoid mo- 
tifs. It is based on the local X t value where units include 
all residues with the indexes in [i; i+\i -1]. We extend 
the usage of equation 6 to measure the quality of each 
solenoid unit and accept units as being solenoids only if 
their solenoid abundance solenoid([i; i + \i -1]) is larger 
than 0.75. If solenoid([i; i + Ai -1]) < 0.75 we continue 
with the next residue i + 1. This condition prevents be- 
ginnings or ends of non-solenoid regions from contrib- 
uting to the motif detection. 

In order to improve identification of consensus motifs, 
we perform structural alignment of all units using rigid 
alignment in the FATCAT [26] and POSA [27] pipeline. 



Results and discussion 

Figures of merit 

Based on the benchmark dataset, we were able to use 
annotations of (i) solenoid repeat lengths to evaluate so- 
lenoid length detection and (ii) predefined residue labels 
to evaluate classification results. Hence, we were able to 
determine true-positive (TP), true-negative (TN), 
false-positive (FP), and false-negative (FN) rates. Fur- 
thermore, sensitivity: Tp T + FN , precision: Tp T ? FN and 
accuracy: TP + IpXfn + tn were determined for the so- 
lenoid and the non-solenoid class, respectively. Our 
final figure of merit for all algorithms was the Matthews 
correlation coefficient [33]: 

TP x TN -FP x FN 
yJ(TP + FP) (TP + FN) {TN + FP) (TN + FN) 



Solenoid unit length estimates 

The fidelity of our solenoid detector was tested on each 
structure from the benchmark dataset. Each automatic- 
ally detected A was compared to the manually annotated 
value. The accuracy was determined to a mean standard 
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deviation of 2.6 residues. The original annotators postu- 
late that an error tolerance of up to 5 residues is accept- 
able for structural solenoid detectors [6], so the accuracy 
of our method is well within the tolerance level. 

Assessing automated solenoid classification 

First, classification of residues to the solenoid or non- 
solenoid class was assessed for a random classifier. The 
underlying random distribution was adjusted to the dis- 
tribution in the benchmark annotation of all residues, 
resulting in a distribution such that -23% of all residues 
were annotated as solenoids while the remaining -77% 
were non-solenoid residues. 

A total of 80,347 random draws from this distribution 
were used to calculate the baseline performance for both 
of our classifiers. In the random assignments, an average 
of -63% of all residues were assigned correctly while -37% 
were false assignments. More precisely, 4,515 solenoids and 
46,695 non-solenoids were predicted correctly. The MCC 
of the random residue classification was -0.008. Extending 
this random assignment with equation 6 and t = 0.33 or 
t = 0.5 to the level of whole structures failed to detect any 
solenoid structure correctly. 

Next, we assessed the classification fidelity of our rule- 
based classifier (Table 1). This classifier showed rela- 
tively high sensitivity and precision for solenoid residues. 
However, results for non-solenoid residues were low, 
resulting in an MCC of 0.11. Extending solenoid classifi- 
cation with equation 6 and t = 0.5 to the whole structure 
level revealed an MCC = 0.46. 

Finally, the classification fidelity of ConSole was deter- 
mined at both the residue and whole-structure level 
(Table 1). Consistent with previous studies, a leave-one- 
out cross validation of the classifier was performed: all fea- 
tures from one structure were excluded for SVM training, 
and this structure was scrutinized [6]. Results of this cross 
validation determined the Matthews correlation coefficient 
for the residue classification to be 0.59. Figure 3 shows 
classification results of four selected structures, while all 
results on the benchmark data are visualized online. De- 
tails on the SVM training parameters and our trials with 
other classifiers (pure SVM contact classification, Decis- 

Table 1 Benchmark results of various solenoid classifiers 



ion Tree correlation classification) are provided in the 
Additional file 1. 

In order to compare ConSole classification to other 
methods, we generalized classification to entire protein 
structures based on equation 6. Results of this general- 
ized classification are also presented in Table 1, and the 
Matthews correlation coefficient was determined here to 
be 0.91. Based on the results published for Raphael, the 
MCC was determined to be 0.87 for SVM value > 0 and 
0.89 for SVM value > 1. 

Additional file 2: Figure SI in the presents the ROC 
curve of whole-structure classification and provides an 
additional means for direct comparison to Raphael. 

Solenoid consensus motif from unit alignments 

Detecting solenoid motifs in sequence is difficult be- 
cause (i) the length of a solenoid repeat is typically short, 
increasing the signal-to-noise ratio as compared to the 
typical domains and full-length proteins, and (ii) se- 
quence similarity may be too weak for detection of very 
divergent repeats. Pfam profiles for solenoid families 
such as LRR, Armadillo, or Ankyrin try to address these 
problems by defining HMMs consisting of several repeat 
units for divergent family members. For instance, the 
LRR profile for LRR1 (PF00560) has a length of 22 resi- 
dues that is in accordance with the primary repeat interval 
[12-45]— class 3 in [34]. However, the HMM for LRR5 
(PF13306) has a length of 129 residues, encompassing ap- 
proximately five repeats of the actual motif. This approach 
is used for other solenoid families: Ankyrin HMMs: 
PF00023— 33 residues and PF12796— 89 residues (3x 
motif repeat), Armadillo/ HEAT: PF02985— 31 residues 
and PF13646— 88 residues (3x motif repeat), and others. 
While improving the recognition sensitivity, this approach 
is inconsistent and leads to confusing results, where simul- 
taneous high-significance matches to overlapping HMMs 
of different lengths are possible. 

We processed structures from the LRR, Ankyrin, and 
Armadillo/HEAT families and determined their respective 
sequence motifs (Figure 4). The motifs obtained with Con- 
Sole were in excellent agreement with motifs available in 
literature. For instance, the sequence motif determined for 



1 



Rule based Rule based ConSole ConSole Raphael S > 0 Raphael S > 

Residue Structure Residue Structure Structure Structure 

Sol. NSol. Sol. NSol. Sol. NSol. Sol. NSol. Sol. Sol. 

Sensitivity 74% 40% 55% 88% 72% 88% 87% 100% 89% 93% 

Precision 89% 19% 65% 83% 66% 90% 100% 95% 86% 98% 

Accuracy 69% 79% 84% 96% 94% 96% 

MCC 0.11 0.46 0.59 0.91 0.87 0.89 



Evaluation of solenoid/non-solenoid classification accuracies determined for all solenoid classifiers mentioned. Sensitivity, precision, accuracy, and MCC were 
determined on a standard benchmark dataset of 105 solenoid (Sol.) and 243 non-solenoid (NSol.) protein structures. Results were determined either for individual 
residues (Residue level) or for whole structures (Structure level). 
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Figure 3 Results of solenoid classification. Four solenoid protein structures taken from the benchmark dataset: (a) 1XKU-A, (b) 1QRL-A, (c) 
1M8Z-A, and (d) 1K5C-A. All are colored according to the ConSole-based classification results. Residues correctly assigned to the solenoid class 
are colored green; residues correctly assigned to the non-solenoid class are colored red. Gray or yellow indicates all residues wrongly assigned to 
the non-solenoid or solenoid class, respectively. Figures for all results in the benchmark are available online. 




Figure 4 Solenoid motif from structure alignment, (a) Automatically detected solenoid units of a Leucine Rich Repeat domain (1DFJ-I) with 
arbitrary solenoid unit coloring. The middle inset shows all units superimposed after multiple-structure alignment with POSA. The consensus motif 
determined for the structurally aligned units is displayed on the right. The next two rows display the same results for (b) an Armadillo repeat 
(1M8Z-A) and (c) an Ankyrin repeat (3B95-A). 

V J 
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the Ribonuclease Inhibitor (RI) was almost identical with 
the RI typical sequence [35]. The only difference to the 
known motif was the start position in sequence, which is 
an arbitrary parameter. To obtain a perfect alignment of 
our motif to the standard motif, the motif sequence had to 
be shifted by 17 residues. 

Unknown solenoid structures in the PDB 

Many protein coordinate sets in the PDB are not de- 
scribed in a peer-reviewed manuscript and also often 
lack any significant annotations. To identify such pro- 
teins, we parsed all PDB headers for the keywords "JRNL 
REF TO BE PUBLISHED," which resulted in a large set 
(16,114) of structures. In the next step, we applied Con- 
Sole to identify novel, perhaps unrecognized solenoid 
protein structures within this set. Indeed, 132 structures 
from this set were classified as solenoids. 

Next, the sequence similarity of each protein against the 
complete collection of PDB proteins was determined. Here 
we ruled out homologs of proteins that have already been 
annotated as solenoids. The search for already known so- 
lenoid homologs was furthermore extended to the Pfam 
database, eliminating proteins mapping to known solenoid 
proteins such as Ankyrin (Pfam: 00023), Armadillo (Pfam: 
00514), or Leucine Rich Repeat (Pfam: 00022). 

Nineteen solenoid structures remained after these steps. 
Many of them were TIM barrels, identified here as sole- 
noids because the torus-like structure also produces the 
second diagonal feature in contact maps. Hence, they are 
sometimes referred to as "closed" solenoids [1,36]. 

Among the remaining true solenoid protein structures 
we observed were a few interesting LRR domains with an 
unusual flat structure (PDB: 4FD0, Bacteriodes caccae) or 
two flat domains connected by a kink (PDB: 4H09, Eubac- 
terium ventriosurn), both highly divergent bacterial LRR 
proteins. Their consensus motifs show an interesting over- 
abundance of phenylalanine residues that could be linked 
to their atypical, flat structures by stacking interactions 
(Figure 5). The conserved region of the sequence motif de- 
termined for 4FD0 (LxxLxLxxLxxL) differs from the clas- 
sical conserved motif region because there was no 
significant alignment of Asparagine-like residues (N) as in 
the standard LRR motif— (LxxLxLxxNxL). Moreover, this 
conserved motif matches with the TpLRR conserved se- 
quence motif, and 4FD0 is probably one of the first struc- 
tures ever to be determined for this LRR family [35]. 

Another unrecognized solenoid protein structure was 
a hypothetical protein from B. thetaiotaomicron (Uni- 
prot: A7LZL0, PDB: 3N6Z). Interestingly, a domain 
homologous to this protein is found in one of the classes 
of immunoglobulin Al proteases, where it overlaps with 
an N-terminal immunoglobulin Al protease domain. 
This domain was not known to consist of repeats, but 
detailed analysis of the automatically identified repeats 
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Figure 5 An unusually flat LRR structure, (a) An unusually flat 
LRR (4FD0-A) found within the few solenoid structures that have not 
yet been mentioned in publications, (b) Structural alignment of 
individual solenoid units with phenylalanine residues colored in 
blue, (c) Sequences of the respective solenoid units where all 
phenylalanines are highlighted in blue and all leucine-like residues 
are highlighted in green. Below is the sequence motif determined 
by ConSole. 



performed as described in the previous paragraph sug- 
gests that repeats in this domain are distantly related to 
GLUG repeats. GLUG is found in other classes of im- 
munoglobulin Al proteases, suggesting that the different 
classes could actually be distantly homologous. 

Analysis of solenoid unit length distributions in solenoid 
families 

Solenoid-like protein structures, by their very nature, gener- 
ally show a high degree of structural regularity. However, 
subtle variations at the level of individual solenoid units are 
possible, with accumulated mutations, deletions or inser- 
tions altering the length and shape of individual units. Such 
small local irregularities can add up to very significant 
structural differences between entire proteins and are im- 
portant for functional adaptations of individual proteins. 

Reliable and reproducible detection of such subtle ir- 
regularities in unit lengths for whole protein families is 
impossible by manual analysis. Hence, we used ConSole to 
automatically analyze the Leucine rich repeat, Ankyrin re- 
peat and Armadillo repeat families for length irregularities 
of solenoid units. The structures were obtained from a 
representative set of PDB structures clustered at 90% se- 
quence identity, a total of 140 structures (396 chains) for 
the LRR family, 107 structures (281 chains) for the 
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Ankyrin and 37 structures (100 chains) for the Armadillo 
repeats. 

Units were assigned to irregularity classes based on the 
unit-length irregularity measured by \Xj - X | , where ; is the 
index of a respective solenoid unit. We now can associate 
length regularity profiles with the regularity of a whole 
structure (Figure 6). Flat profiles indicate a very regular 
structure, while peaks in the profile indicate positions of 
unusual variations in solenoid unit lengths and, as a con- 
sequence, a divergent/irregular overall structure. 

For instance, the unit-length irregularity distribution 
of the LRR structures revealed that 57% of all analyzed 
solenoid units are highly regular (Figure 7) and, such as 
in the case of Ribonuclease Inhibitor (1DFJ), result in a 
regular, torus-like structure. However, large deviations 
from the mean length A were observed in many struc- 
tures, such as in a structure of the TLR4 extracellular do- 
main (2Z64) where irregularities result in a horseshoe-like 
structure with varied curvature. This irregularity provides 
TLRs with the ability to adjust their shape to bind differ- 
ent ligands [37] . 



We show that Ankyrin repeats are the most regular 
among the three families we analyzed here, with no devi- 
ation from A for approximately 75% of all solenoid units 
(Figure 7). On the other hand, the Armadillo repeats 
turns out to be the most irregular with 23% of all solen- 
oid units being at least two residues off from the average 
length A. 

Conclusions 

In this work we present ConSole, an algorithm based on 
a novel combination of contact map analysis and image- 
processing algorithms that focuses on recognition of so- 
lenoid repeats in structures of periodic proteins. 

Contact maps are naturally suited for solenoid recog- 
nition because of the presence of a characteristic line 
parallel to the main diagonal in the contact map. Albeit 
being the most intuitive approach for solenoid unit de- 
tection, direct analysis of contacts did not provide the 
desired accuracy of repeat recognition. 

To improve the recognition, we used a standard tech- 
nique of template matching in image analysis based on 




Figure 6 Unit-length irregularities in Leucine Rich Repeat structures, (a) The structure of the Ribonuclease Inhibitor (1DFJ-I) shows only 
minimal irregularity. The irregularity profile indicates variations in unit length Xj (blue curve) and absolute deviation | A - Xj | from the mean unit 
length X for each solenoid unit j (green curve). The two structural motifs at the bottom represent the two main structural solenoid unit motifs 
detected for 1 DFJ. Possible positions of these structural motifs are indicated above the respective structure, (b) The LRR domain in CARMIL (A) 
resembles an unusually irregular Ribonuclease Inhibitor structure for which variations in the irregularity profile indicate variability in solenoid units. The 
structural segment of units 6-9 (dashed box) depicts irregularities in units in more detail. Each unit is colored according to the irregularity distribution 
in Figure 7. Gray segments were classified as insertions by ConSole and do not contribute to the irregularity calculation, (c) The structure of TLR4 (2Z64-A) 
has a distinctively more irregular region (units 5-10, dashed box) when compared to other segments of the structure. Irregularity in this region accrues a 
significant change in the curvature. 
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Figure 7 Distribution of unit-length irregularities over several solenoid families. Structures from the Leucine Rich Repeat, Ankyrin Repeat 
and Armadillo Repeat were sampled for irregularities. Displayed is the length irregularity distribution for each respective family, where the number of 
residues a solenoid unit length differed from A. 



successive cross correlations of dynamically generated 
solenoid and non-solenoid patterns. The further classifi- 
cation of the computed correlation coefficients with a 
support vector machine allowed high-accuracy solenoid 
classification as measured by the MCC on the standard 
solenoid recognition benchmark. 

ConSole is both more accurate and much faster than 
any available solenoid classifier. However, there are still 
a few examples of solenoid protein structures that pose 
challenges for the current implementation. Most notable 
are protein structures with non-solenoid segments run- 
ning close to the solenoid domain. Such non-solenoid 
segments alter the contact patterns in a way that leads 
ConSole to classify neighboring solenoid residues as 
non-solenoids. An example of such a structure is the 
structure of gamma carbonic anhydrase (1QRL). Another 
factor for false classification results were false-negative 
classifications of complete solenoid units encapsulating 
long insertions (4ECO). While the insertion segment was 
classified correctly as a non-solenoid, residues in solenoid 
units in contact with the insertion were wrongly classified 
as non-solenoids. 

One interesting application of ConSole is to analyze 
individual solenoid units and retrieve their consensus 
motifs from structural alignments. As we demonstrated, 
this application is robust enough to be integrated in a 
completely automated pipeline. We proved that se- 
paration of individual solenoid units and subsequent 
multiple structure alignment reliably detects solenoid 
specific motifs. Consensus motifs stemming from dis- 
tinctive solenoid families were retrieved successfully for 
individual structures and indicate that current Pfam 
HMMs for solenoids were trained using sequences that 
were too long. 

Finally, we extended ConSole analysis from individual 
structures to large groups of proteins in order to analyze 
the extent of structural irregularities within each family. 
Such local irregularities are correlated with function differ- 
ences between homologs from the same family, such as a 
difference between Ribonuclease Inhibitor-like, regular 



and TLR receptors, the irregular members of the LRR 
family. We were also able to compare the irregularity pat- 
terns and show that Ankyrin structures generally are more 
regular than LRRs and Armadillo repeats. 

Thus, we believe that ConSole would be useful for fur- 
ther sequence- or structure-based analysis of solenoid 
proteins as it allows the user to reliably identify consen- 
sus motifs and to detect structural irregularities, leading 
either to developing more accurate motif definitions or 
to structure analysis of individual units and detecting 
their variations. 

Availability of supporting data 

The data sets supporting the results of this article are 
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